Is there statistical evidence of fewer births on Feb 29 in the US?
An evaluation using 2000-2014 data
Leap day aversion? A 2012 calandar of births.
Is there statistical evidence of fewer births on Feb 29 in the US? The 2012 calendar below, showing number of birth as represented by color and bubble size, makes it apparent that there can be quite a lot of variation in the number of births from one day to another. For example December 25th’s recorded number of births is small compared with its neighbors, and births on the weekends appear to be much lower than their weekday counterparts.
With February 29th coming up this week, you might wonder, are there also relatively fewer births on February 29th, circled on the calendar. Feb 29th is a weird birthday to have – only truly ‘celebrated’ every four years. But it’s hard to draw any conclusions about a statistical evidence from this visualization and a single year of data. In what follows, we’ll assess the question, ‘Is there statistical evidence of fewer births on February 29th based on number of births in the U.S. over a 15 year period, 2000 to 2014?’
births_df %>%
filter(year == 2012) %>%
ggcalendar() +
aes(date = date) +
geom_point_calendar() +
aes(size = births) +
aes(color = births) +
geom_text_calendar(aes(label = day(date)),
color = "oldlace",
size = 2) +
guides(
colour = guide_legend("Births"),
size = guide_legend("Births")
) +
geom_point_calendar(data = data.frame(date =
as_date("2012-02-29")),
size = 7, color = "red", shape = 21, stroke = 2) +
scale_color_viridis_c() +
labs(title = "The year in 2012 in births")Zooming in on the week of February 29, 2012 in the figure below, it’s easier to appreciate the dip in number of births that is observed on February 29th. But the question remains, Is this variability usual, or is there an aversion to birthing on February 29th that is statistically detectable?
births_df |>
filter(date >= as.Date("2012-02-26") &
date <= as.Date("2012-03-03")) |>
ggplot() +
aes(x = date, y = births) +
geom_point() +
geom_line(linetype = "dashed") +
geom_label(aes(label = wday(date, label = T))) +
geom_point(data = . %>% filter(date == as.Date("2012-02-29")),
color = "red", shape = 1, size = 20, stroke = 2, alpha = .8) +
labs(title = "Births the week of Feb 29, 2012")I begin the analysis with a visual format that’s familiar to the audience, and where we can be pretty sure of the effects of relevant variables just by inspection
2000-2014 U.S. Births Data
The 15 years of data we’ll look at was available via the #TidyTuesday project – their October 2, 2018 dataset. Some data cleaning is required including constructing a full date variable from year, month and date_of_month variables. I also normalize the dates (to 2020) for superimposed within-year comparisons - importantly 2020 is a leap year so Feb 29th dates are valid and not dropped. I further include indicator variables that I anticipate to be important as well as an indicator variable that captures the condition of interest, ind_Feb_29th.
library(tidyverse)
births_path <- "https://raw.githubusercontent.com/EvaMaeRey/tableau/9e91c2b5ee803bfef10d35646cf4ce6675b92b55/tidytuesday_data/2018-10-02-us_births_2000-2014.csv"
library(ggcalendar)
readr::read_csv(births_path) %>%
filter(year != 2015) %>%
mutate(month = str_pad(month, 2, pad = "0"),
date_of_month = str_pad(date_of_month, 2, pad = "0")) %>%
mutate(date = paste(year, month, date_of_month, sep = "-") %>% as_date()) %>%
mutate(ind_holiday =
(month == "12" & date_of_month %in% 24:31) |
(month == "07" & date_of_month %in% c("04", "05")) |
(month == "01" & date_of_month %in% c("01", "02")) |
(month == "10" & date_of_month == "31") |
(month == "11" & date_of_month %in% 20:30) |
(month == "02" & date_of_month %in% 14)
) |>
mutate(date_in_2020 = paste(2020, month, date_of_month, sep = "-") %>% as_date()) |>
mutate(ind_weekend = wday(date) == 1 | wday(date) == 7) |>
mutate(ind_Feb_29th = month(date) == 2 & day(date) == 29) |>
mutate(ind_13th = day(date) == 13) |>
mutate(ind_Fri13th = wday(date) == 6 & day(date) == 13) ->
births_dfVisual exploration
Baseline visualization of the time series
Visualizing the U.S. births time series, we see that there is tremendous variability in number of birth per day. The standard deviation is 2326 in this time span, with the average number of births around 11400, as marked in the visualization below.
births_df |>
ggplot() +
aes(x = date, y = births) +
geom_point(size = .2) ->
time_series_base
add_y_mean <- function(){
list(ggxmean::geom_y_mean(linetype = "dotted"),
ggxmean::geom_y_mean_label() )
}
time_series_base +
add_y_mean()Univariate distribution
Looking at the univariate distribution of births per day, the bi-modal pattern is striking.
# ggchalkboard:::geoms_chalk_on()
ggplot(births_df) +
aes(x = births) +
geom_histogram() +
ggxmean::geom_x_mean() +
ggxmean::geom_x_mean_label() +
geom_rug(alpha = .2) +
labs(x = "Number of births") +
# ggchalkboard::theme_chalkboard() +
labs(title = "Distribution of Number of Births in the U.S. each day from 2000-2014" %>% str_wrap(45)) ->
univariate_base; univariate_baseExploring bi-modality with weekend indicator
Breaking this data up, we explore the hypothesis that preference for scheduled birth (inducements or c-sections) is for weekdays. In the figure below see that most of the bimodality is explained by weekend v. weekday effects. The difference in means for these groups is substantial - around 4650 difference in number of births!
Day of weeks effects
When we return to our time-series plot, but breaking the plots up by weekend, you may notice further bimodality within the ‘weekend’ subplot suggesting differences within the weekend for Saturday and Sunday.
time_series_base +
facet_wrap(~ind2cat::ind_recode(ind_weekend), ncol = 1) +
labs(title = "We explore the bimodal distribution looking at 'weekend effects'" %>% str_wrap(45))Following on that observation, we investigate by-weekday (Sunday, Monday, Tuesday etc) differences in number of births below. We look first at the histogram and then the time series plot.
We add a confidence interval around the mean based on a t-test.
time_series_base +
facet_wrap(~wday(date, label = T,
week_start = "Monday"), ncol = 2) +
labs(title = "Number of births 2000-2014 by day of week" %>% str_wrap(45))Holidays and outlying observations
The by-day-of-the-week exploration (and weekend vs. weekday) further exposes some outlying observations. We next explore if these lower-than-usual their counterparts are by and large holiday or holiday adjacent days. Due to time constraints, I’m relying a quick indicator variable that I created that include many holiday and adjacent dates.
last_plot() -> tplot
tplot[[2]] <- NULL
tplot +
geom_point(alpha = .5, size = .5) +
aes(color = ind2cat::ind_recode(ind_holiday)) +
labs(color = NULL) +
theme(legend.position = 'top', legend.justification = "left")This exploration suggests that holidays drive much of the out-of-character observations.
Outlyingness might also be due to aversion due to superstition - for example the 13th of each month, especially Fridays the 13th, and Halloween might be more outlying though not holidays.
Also, holiday adjacency might also lead to lower number of births - for example federal holidays that fall on the weekend are often observed on a Monday. The outliers that we observe on Monday are not categorized as holiday, but this likely due to the imperfect ind_holiday coding. President’s day was not captured in my coding, for example and always falls on a Monday.
Because February 29 is not a holiday or holiday-adjacent, it is appropriate to prune our data to relevant comparisons and try to remove likely holidays; this should lead to more precise estimates in our final analysis.
Due to time constraints and convenience, our pruning will be statistical instead of using substantive knowledge (i.e. using lists of federal and celebrated holidays, etc).
The approach is to prune observations for which a simple linear model’s error is greater than 3 standard deviations from the mean residual error (zero).
The linear model contain date and day of the week as categories.
To start, I visualize the parallel lines model fit.
compute_panel_lm_parallel <- function(data, scales){
model <- lm(y ~ x + category, data = data)
data |>
mutate(y = model$fitted)
}
StatParallel <- ggplot2::ggproto(`_class` = "StatParallel",
`_inherit` = ggplot2::Stat,
required_aes = c("x", "y", "category"),
compute_panel = compute_panel_lm_parallel,
default_aes = aes(color = after_stat(category)))
geom_ols_linear_parallel <- function(mapping = NULL, data = NULL,
position = "identity", na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE, ...) {
ggplot2::layer(
stat = StatParallel, # proto object from Step 2
geom = ggplot2::GeomLine, # inherit other behavior
data = data,
mapping = mapping,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ...)
)
}
ggplot(births_df) +
aes(x = date, y = births,
color = wday(date, label = T), category = wday(date, label = T)) +
geom_point(alpha = .15) +
geom_ols_linear_parallel()Now, we estimate model outside of the visualization tool, and collect the residuals. Then we visualize the distribution of the residuals and mark the bound of 3 standard deviations from the mean, beyond which we’ll prune our dataset.
m <- lm(births ~ date + wday(date, label = T), data = births_df)
births_df$residuals_wday <- m$residuals
ggplot(births_df) +
aes(x = residuals_wday) +
geom_histogram() +
ggxmean::geom_x_mean() +
ggxmean:::geom_x3sd(linetype = "dashed")Dropping the large residuals results in 116 days dropped from our dataset.
births_df_pruned <- births_df |>
filter(abs(residuals_wday) <
3*sd(births_df$residuals_wday))
time_series_base %+% births_df_pruned +
facet_wrap(~wday(date, label = T), ncol = 2) +
labs(title = "Outlier pruned" %>% str_wrap(45)) In the remainder of the exploration and analysis, we’ll use the pruned version of the data, unless otherwise specified.
Seasonality and longer term trends
To look at the within-year periodicity, we use a normalized date variable, ‘date_in_2020’. We see summer and fall surges in number of births.
ggplot(births_df_pruned) +
aes(x = date_in_2020,
y = births) +
geom_point() +
aes(color = factor(year)) +
labs(color = NULL) +
facet_wrap(~wday(date, label = T)) +
facet_wrap(~ind2cat::ind_recode(ind_weekend), ncol = 2) +
geom_smooth() +
labs(title = "Using loess smoothing within year and weekend v. weekday" ) +
scale_x_date(labels = c("Jan", "April", "July", "Oct", "Jan")) +
labs(x = NULL)In the plot above, we also see differences in the rate of births by year. Exploring that more directly, we look at the distributions of number of daily birth by year (this is done with the unpruned data). Diamonds mark the mean for each year.
ggplot(births_df) +
aes(x = births, y = factor(year)) +
# geom_histogram() +
ggridges::geom_density_ridges() +
stat_summary(fun.y = mean, geom = "point", col = "goldenrod3",
size = 10, shape = "diamond") +
labs(x = "Daily births")While, we are here, we summarize to the total number of births for each year in this period in the U.S. That number hovers around 4 million.
ggplot(births_df) +
aes(x = year, weight = births/1000000) +
ggchalkboard:::theme_chalkboard() +
geom_bar(fill = "lightyellow", alpha = .75) +
stat_count(geom = "text",
linetype = "dashed",
aes(label = after_stat(round(count, 1))),
vjust = 1.4, size = 3,
color = "darkseagreen4") +
labs(y = "Number of births (millions)") +
labs(title = "Number of births in the U.S. (millions)") +
scale_x_continuous(breaks = 2000:2014, labels = c("2000", "", "", "","'04", "", "", "","'08", "", "", "","'12", "", "'14"))Modeling…
births_df_pruned
#> # A tibble: 5,363 × 13
#> year month date_of_month day_of_week births date ind_holiday
#> <dbl> <chr> <chr> <dbl> <dbl> <date> <lgl>
#> 1 2000 01 01 6 9083 2000-01-01 TRUE
#> 2 2000 01 02 7 8006 2000-01-02 TRUE
#> 3 2000 01 03 1 11363 2000-01-03 FALSE
#> 4 2000 01 04 2 13032 2000-01-04 FALSE
#> 5 2000 01 05 3 12558 2000-01-05 FALSE
#> 6 2000 01 06 4 12466 2000-01-06 FALSE
#> 7 2000 01 07 5 12516 2000-01-07 FALSE
#> 8 2000 01 08 6 8934 2000-01-08 FALSE
#> 9 2000 01 09 7 7949 2000-01-09 FALSE
#> 10 2000 01 10 1 11668 2000-01-10 FALSE
#> # ℹ 5,353 more rows
#> # ℹ 6 more variables: date_in_2020 <date>, ind_weekend <lgl>,
#> # ind_Feb_29th <lgl>, ind_13th <lgl>, ind_Fri13th <lgl>, residuals_wday <dbl>
births_base_model <- lm(formula =
births ~ factor(year) + month + factor(day_of_week),
data = births_df_pruned)
ggplot(fortify(births_base_model)) +
aes(x = .fitted, y = .resid) +
geom_point(alpha = .2) +
geom_hline(yintercept = 0, color = "darkred") +
geom_smooth()
births_base_model_feb29 <- lm(formula =
births ~ factor(year) + month + factor(day_of_week) + ind_Feb_29th,
data = births_df_pruned)
summary(births_base_model_feb29)
#>
#> Call:
#> lm(formula = births ~ factor(year) + month + factor(day_of_week) +
#> ind_Feb_29th, data = births_df_pruned)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2662.27 -224.20 9.72 227.49 2092.26
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 11718.19 31.98 366.443 < 2e-16 ***
#> factor(year)2001 -77.32 30.95 -2.499 0.012498 *
#> factor(year)2002 -111.48 30.95 -3.602 0.000318 ***
#> factor(year)2003 65.25 30.95 2.109 0.035032 *
#> factor(year)2004 83.22 30.90 2.693 0.007097 **
#> factor(year)2005 187.62 30.90 6.071 1.36e-09 ***
#> factor(year)2006 546.70 30.92 17.678 < 2e-16 ***
#> factor(year)2007 666.50 30.99 21.507 < 2e-16 ***
#> factor(year)2008 441.51 30.94 14.269 < 2e-16 ***
#> factor(year)2009 132.70 30.95 4.288 1.83e-05 ***
#> factor(year)2010 -233.42 30.95 -7.543 5.38e-14 ***
#> factor(year)2011 -377.26 30.90 -12.208 < 2e-16 ***
#> factor(year)2012 -397.76 30.92 -12.865 < 2e-16 ***
#> factor(year)2013 -457.70 30.95 -14.790 < 2e-16 ***
#> factor(year)2014 -359.70 30.95 -11.623 < 2e-16 ***
#> month02 137.82 28.04 4.915 9.15e-07 ***
#> month03 120.59 27.33 4.413 1.04e-05 ***
#> month04 16.46 27.55 0.598 0.550196
#> month05 271.60 27.56 9.857 < 2e-16 ***
#> month06 476.20 27.55 17.284 < 2e-16 ***
#> month07 831.77 27.54 30.206 < 2e-16 ***
#> month08 885.85 27.34 32.398 < 2e-16 ***
#> month09 1155.37 27.81 41.548 < 2e-16 ***
#> month10 371.83 27.33 13.606 < 2e-16 ***
#> month11 422.01 28.05 15.047 < 2e-16 ***
#> month12 442.50 27.76 15.942 < 2e-16 ***
#> factor(day_of_week)2 1039.03 21.32 48.737 < 2e-16 ***
#> factor(day_of_week)3 811.81 21.32 38.080 < 2e-16 ***
#> factor(day_of_week)4 849.50 21.42 39.663 < 2e-16 ***
#> factor(day_of_week)5 548.36 21.43 25.594 < 2e-16 ***
#> factor(day_of_week)6 -3591.86 21.24 -169.077 < 2e-16 ***
#> factor(day_of_week)7 -4635.93 21.25 -218.211 < 2e-16 ***
#> ind_Feb_29thTRUE -861.31 208.23 -4.136 3.58e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 414 on 5330 degrees of freedom
#> Multiple R-squared: 0.9677, Adjusted R-squared: 0.9675
#> F-statistic: 4991 on 32 and 5330 DF, p-value: < 2.2e-16Narrow to calendar peers
# ggchalkboard:::geoms_chalk_off()
births_df |>
filter(date_in_2020 >= as.Date("2020-02-20")) |>
filter(date_in_2020 <= as.Date("2020-03-13")) |>
ggplot() +
aes(x = date_in_2020, y = births) +
geom_line( alpha= .2) +
geom_point(aes(shape = ind2cat::ind_recode(ind_weekend),
fill = ind2cat::ind_recode(ind_Feb_29th),
size = ind2cat::ind_recode(ind_Feb_29th)),
shape = 21) +
geom_text(aes(label = wday(date, label = T)),
vjust = -0.52, size = 3) +
# geom_text(aes(label = births),
# vjust = 1.2) +
facet_wrap(~year) +
scale_size_discrete(range = c(4,6))
last_plot() +
facet_null() +
aes(group = paste(year, ind_weekend, isoweek(date))) +
aes(color = year)births_model_feb29_peers <- lm(formula =
births ~ factor(year) + date_in_2020 + factor(day_of_week) + ind_Feb_29th,
data = births_df_pruned |>
filter(date_in_2020 >= as.Date("2020-02-16")) |>
filter(date_in_2020 <= as.Date("2020-03-9")))
summary(births_model_feb29_peers)
#>
#> Call:
#> lm(formula = births ~ factor(year) + date_in_2020 + factor(day_of_week) +
#> ind_Feb_29th, data = filter(filter(births_df_pruned, date_in_2020 >=
#> as.Date("2020-02-16")), date_in_2020 <= as.Date("2020-03-9")))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -997.5 -171.2 -9.9 178.3 1042.1
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -21789.586 42610.810 -0.511 0.609
#> factor(year)2001 -99.521 85.558 -1.163 0.246
#> factor(year)2002 -48.996 85.548 -0.573 0.567
#> factor(year)2003 113.185 85.558 1.323 0.187
#> factor(year)2004 138.110 84.382 1.637 0.103
#> factor(year)2005 96.470 85.519 1.128 0.260
#> factor(year)2006 477.589 85.508 5.585 5.09e-08 ***
#> factor(year)2007 734.570 85.558 8.586 4.40e-16 ***
#> factor(year)2008 559.862 84.381 6.635 1.44e-10 ***
#> factor(year)2009 384.811 85.548 4.498 9.69e-06 ***
#> factor(year)2010 -128.954 85.559 -1.507 0.133
#> factor(year)2011 -396.802 85.519 -4.640 5.14e-06 ***
#> factor(year)2012 -495.252 84.343 -5.872 1.11e-08 ***
#> factor(year)2013 -566.769 85.548 -6.625 1.53e-10 ***
#> factor(year)2014 -405.133 85.558 -4.735 3.33e-06 ***
#> date_in_2020 1.823 2.326 0.784 0.434
#> factor(day_of_week)2 1112.832 59.098 18.830 < 2e-16 ***
#> factor(day_of_week)3 988.488 58.821 16.805 < 2e-16 ***
#> factor(day_of_week)4 1034.869 58.743 17.617 < 2e-16 ***
#> factor(day_of_week)5 841.286 58.827 14.301 < 2e-16 ***
#> factor(day_of_week)6 -3204.244 58.736 -54.554 < 2e-16 ***
#> factor(day_of_week)7 -4258.234 58.818 -72.397 < 2e-16 ***
#> ind_Feb_29thTRUE -867.752 146.946 -5.905 9.22e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 285.9 on 311 degrees of freedom
#> Multiple R-squared: 0.9835, Adjusted R-squared: 0.9824
#> F-statistic: 844 on 22 and 311 DF, p-value: < 2.2e-16